-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move prediction cache to Learner. #5220
Conversation
Requires some more thoughts to make a clean design for the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an idea, what if the learner handles all prediction caching and when it needs to update the cache it calls predict only on the most recent tree. We remove all functions like "UpdatePredictionCache" and the caching mechanism is isolated to the learner. There might be a small loss of efficiency because we do not make use of the information in the Updater, but we still achieve algorithmic efficiency by predicting from only one tree per iteration.
04d4dc5
to
5e1d8c0
Compare
63d001d
to
def3135
Compare
def3135
to
4f31bfb
Compare
a6f14ff
to
0d544e7
Compare
@hcho3 We talked about adding a custom segfault handler like: https://github.com/bombela/backward-cpp I added a simple one in this PR for debugging on Jenkins (and it was helpful). But it breaks JVM package's error handling/recovery logic, so I will remove it once the PR is done. You may find it useful in the future as it's super simple (about 50+ added lines). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, I think the approach is solid.
This PR is too large to review properly. The gblinear changes can be extracted. There are unnecessary formatting changes or changes not related to prediction caching. Changing the DMatrix pointer arguments to shared pointers could potentially be extracted.
I want to review the changes to caching logic more carefully and in parts. You can help by breaking things down as much is possible. As you know this part of the code base is very bug prone.
src/learner.cc
Outdated
|
||
predictions_.SetDevice(generic_parameters_.gpu_id); | ||
predictions_.Resize(predt.predictions.Size()); | ||
predictions_.Copy(predt.predictions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an extra copy here that wasn't there before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell It's a copy moved from Predictor::PredictFromCache
. I removed this function altogether as now the cache is managed by Learner
, hence the copying is also done by Learner
. It's 1 copy less as I also removed the temporary vector in C API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use two differently sized evaluation matrices, the predictions vector will get resized twice every single iteration. This is a very common case.
src/learner.cc
Outdated
this->PredictRaw(data.get(), predts, training, ntree_limit); | ||
out_preds->SetDevice(generic_parameters_.gpu_id); | ||
out_preds->Resize(predts->predictions.Size()); | ||
out_preds->Copy(predts->predictions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, is there extra copying happening here?
Wil revert some changes, I sometimes added formatting changes because clangd is warning with clang-tidy checks and it's quite difficult to focus. |
This might not be possible as we need the shared pointer for constructing |
You could do the shared pointers first. This way we get past some harmless changes before we look at the core logic. |
Extracted from dmlc#5220 .
545726d
to
cbd5a3c
Compare
int old_ntree = model.trees.size() - num_new_trees; | ||
// update cache entry | ||
for (auto& kv : (*cache_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth noting, before the PR, this function updates the cache for all DMatrix, including validation datasets. Now it only updates the training DMatrix, as I don't think the updater
can/should have information about validation datasets in the future (otherwise that would be peeking and violates the purpose of validation).
* Clean-ups - Remove duplicated cache in Learner and GBM. - Remove ad-hoc fix of invalid cache. - Remove `PredictFromCache` in predictors. - Remove prediction cache for linear altogether, as it's only moving the prediction into training process but doesn't provide any actual overall speed gain. - The cache is now unique to Learner, which means the ownership is no longer shared by any other components. * Changes - Add version to prediction cache. - Use weak ptr to check expired DMatrix. - Pass shared pointer instead of raw pointer.
bea3258
to
2bce357
Compare
Not sure why, putting |
If you use device vectors as static variables they can get destructed after the cuda context has been released, leading to errors. |
This might make #5207 difficult to implement... |
@RAMitchell Anyway problem for another day. I cleaned up the PR per your request. Formatting changes are reverted. But the change in linear is kept, as gbm no longer has access to global prediction cache, the code removal for linear is necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very nice. Is using an integer for prediction state and begin/end ranges is the most robust approach? You have already noted that this causes ambiguity when there are multiple output groups. In a perfect world we would have unique hashes for every tree generated and be able to see which trees have contributed to a prediction vector. Maybe we could get closer to this?
It looks to me like your version is smarter and only computes predictions that it needs, now that some state is carried along with the prediction vector. If you change the updaters to return false in UpdatePredictionCache you could get an idea of performance regression from removing these functions. Of course we cannot do this until gpu_predictor is able to predict from Ellpack, otherwise we have memory allocation problems.
src/learner.cc
Outdated
|
||
predictions_.SetDevice(generic_parameters_.gpu_id); | ||
predictions_.Resize(predt.predictions.Size()); | ||
predictions_.Copy(predt.predictions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use two differently sized evaluation matrices, the predictions vector will get resized twice every single iteration. This is a very common case.
Before this PR, Learner stores another prediction vector for each DMatrix (hence 2 prediction vectors for each DMatrix, one for cache one for temporary), I can restore that to avoid the resize. Another way around is I can modify the input to metric into a Update: I restored the additional vectors for each matrix to avoid any additional refactor in this PR. |
I don't have a better name than As another problem regarding refresh updater I mentioned offline with @RAMitchell , it should be fine as training continuation requires at least 1 serialization. The cache is cleared. |
Codecov Report
@@ Coverage Diff @@
## master #5220 +/- ##
=======================================
Coverage 83.76% 83.76%
=======================================
Files 11 11
Lines 2409 2409
=======================================
Hits 2018 2018
Misses 391 391 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Looking through the predict function it seems like there are about four copies between generating the predictions and reaching the end user. I think we need to look at optimising this process. We can look at this when considering prediction to device memory.
The other thing I think can be improved with the predict API is passing in a bunch of bools to change behaviour. I think there should be separate functions for separate functionality.
The copying is annoying. I don't have any thoughts on to get rid of them yet. |
Clean-ups
PredictFromCache
in predictors.Changes
Risks
Before the DMatrix is tied to Booster with a shared pointer, which means the access to it is always safe as it can't expire before the booster. Now we need to make sure the safety ourselves.
Advantages
As now we don't store the shared pointer any more, this should reduce the memory usage when doing grid search with libraries like Scikit-Learn (during evaluation the training matrix can be released). Also, performance for training continuation (incremental training, checkpoint recovery) will be significantly better as we cache all encountered DMatrix.
Last thing is this PR is intended to ease the development of our DMatrix refactoring plan, which requires an easy to access and reliable prediction cache, so that we can use quantile values as a proxy to actual data.
Related issues and PRs:
PRs:
#5272
Issues:
#3946
#4786
#4774
#4482